DS 6030 | Fall 2023 | University of Virginia

Final Project: Generalized Additive Models (GAM)

Author

Rachel Holman, Grace Zhang, Rose E.M.

Published

December 5, 2023

library(tidyverse)    # functions for data manipulation  
library(ggplot2)
library(MASS)
library(dplyr)
library(GGally)
library(kableExtra)
library(DT)
library(pROC)
library(caret)

Introduction to Generalized Additive Models (GAM)

Bike-sharing systems represent a contemporary evolution in rental services, automating membership processes and offering users the convenience to pick up and return bikes at their convenience. With over 500 programs worldwide and a fleet exceeding half a million bicycles, these systems have garnered widespread attention for their impact on traffic, environmental sustainability, and public health.

With this distinctive feature transformation in bike-sharing systems and the intricate patterns of city mobility, one of our goals is to conduct linear and logistic Generalized Addictive Models (GAM) and uncover valuable insights and predict the daily demand for bikes and detect the probability of a given day being a holiday.

Based on the available information, it is known that GLMs expand linear models for diverse data relationships, assuming linearity on a transformed scale. In contrast, GAMs allow non-linear connections, offering more flexibility for intricate relationships. Hence, with our second goal, we aim to compare and test both advantages and disadvantages of GAMs with and without smoothing parameters.

Bike sharing has become a convenient mode of transportation in urban areas and in university towns, such as, Charlottesville. The data came from Fanaee-T and is available on the UCI ML web site or at this github link.

This dataset contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.

  • instant: record index

  • dteday : date

  • season : season (1:winter, 2:spring, 3:summer, 4:fall)

  • yr : year (0: 2011, 1:2012)

  • mnth : month ( 1 to 12)

  • hr : hour (0 to 23)

  • holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)

  • weekday : day of the week

  • workingday : if day is neither weekend nor holiday is 1, otherwise is 0.

  • weathersit :

      - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    
      - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    
      - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    
      - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  • temperature : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)

  • atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)

  • humidity: Normalized humidity. The values are divided to 100 (max)

  • windspeed: Normalized wind speed. The values are divided to 67 (max)

  • casual: count of casual users

  • registered: count of registered users

  • count: count of total rental bikes including both casual and registered

Specialty of GAM

GAMs could be used on datasets that consist of predictors that are associated with the outcome variable and the response variable. GAMs are just as versatile as the GLMs because they can be used to model both numeric and categorical response variables. However, GAMs could also be applied to model features that have nonlinear relationship with the response variable.

Advantages and Disadvantages

The primary advantage of GAM is its ability to model highly complex nonlinear relationships when the number of potential predictors is large. The main disadvantage of GAM is its computational complexity; like other nonparametric methods, GAM has a high propensity for overfitting. The table below shows a clear outline of the biggest advantages and disadvantages of GAM.

Advantages of GAMs Disadvantages of GAMs
1. Flexibility: GAMs can model various relationships, including non-linear and complex patterns. Complexity: GAMs can become computationally intensive for large datasets or high-dimensional problems.
2. Interpretability: They provide interpretable results, making understanding the relationships between predictors and the response easier. Data Requirements: GAMs may require larger sample sizes to capture non-linear patterns effectively.
3. Non-linearity: GAMs can capture intricate, non-linear relationships that traditional linear models cannot represent. Sensitivity to Smoothing Parameters: The choice of smoothing parameters can impact model results, requiring careful tuning.
4. Regularization: GAMs can incorporate regularization techniques to prevent overfitting and improve generalization. Does not natively handle missing values. GAMs require the user to select a method to handle missing data. The default option is to drop NA rows.
5. Visualization: The smooth functions in GAMs can be visually represented, aiding in model interpretation. Limited to Regression and Classification: GAMs are primarily suited for regression and classification tasks and may not be suitable for more complex tasks like image recognition.
6. Insensitive to Outliers: GAM outliers do not have as strong of a global influence even if they have a local influence on fit. Wider confidence intervals and less powerful tests. As a tradeoff for the high flexibility, GAMs generally have wider confidence intervals than simpler models.

R and Python packages for GAMs

Documentation for GAM in R

Documentation for GAM in Python

Data Cleaning

Before we start performing GAM for linear regression, we loaded our dataset and checked if there are any missing values, abnormal zeros, and NAs in the dataset.

# import data
bikes <- read.csv('bikes_hour.csv')
datatable(bikes)
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
#data type of variables
str(bikes)
'data.frame':   17379 obs. of  17 variables:
 $ instant    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ dteday     : chr  "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
 $ season     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ yr         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mnth       : int  1 1 1 1 1 1 1 1 1 1 ...
 $ hour       : int  0 1 2 3 4 5 6 7 8 9 ...
 $ holiday    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ weekday    : int  6 6 6 6 6 6 6 6 6 6 ...
 $ workingday : int  0 0 0 0 0 0 0 0 0 0 ...
 $ weathersit : int  1 1 1 1 1 2 1 1 1 1 ...
 $ temperature: num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
 $ atemp      : num  0.288 0.273 0.273 0.288 0.288 ...
 $ humidity   : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
 $ windspeed  : num  0 0 0 0 0 0.0896 0 0 0 0 ...
 $ casual     : int  3 8 5 3 0 0 2 1 1 8 ...
 $ registered : int  13 32 27 10 1 1 0 2 7 6 ...
 $ count      : int  16 40 32 13 1 1 2 3 8 14 ...
# check NAs
colSums(is.na(bikes))
    instant      dteday      season          yr        mnth        hour 
          0           0           0           0           0           0 
    holiday     weekday  workingday  weathersit temperature       atemp 
          0           0           0           0           0           0 
   humidity   windspeed      casual  registered       count 
          0           0           0           0           0 
# check zeros
colSums(bikes == 0, na.rm = TRUE)
    instant      dteday      season          yr        mnth        hour 
          0           0           0        8645           0         726 
    holiday     weekday  workingday  weathersit temperature       atemp 
      16879        2502        5514           0           0           2 
   humidity   windspeed      casual  registered       count 
         22        2180        1581          24           0 

Then, we ran a quick exploratory data analysis by computing the correlation between all numeric variables that do not classified with levels - yr, mnth, ‘hour’, ‘weathersit’, ‘’temperature’, ‘atemp’, ‘humidity’, ‘windspeed’ ‘casual’, ‘count’.

# correlation between numeric variables
num_corr_tb<- bikes %>% 
  #include dplyr::, otherwise dplyr would clash with MASS package
  dplyr::select(-season, -dteday, -instant, -holiday, -weekday,
                -workingday, -registered)

 
#num_corr_tb 
datatable(cor(num_corr_tb))

As the correlation table shown above, we did not see any strong correlation between these numeric variables.

GAM for Linear Regression (Predicting The Count of Bikes Rented)

GAM Fit

For GAMs we would make use of the library gam in RStudio, so the first thing that we have to do is to install this package by executing install.packages(“gam”) once. Then we load the library.

library(gam)
Loading required package: splines
Loading required package: foreach

Attaching package: 'foreach'
The following objects are masked from 'package:purrr':

    accumulate, when
Loaded gam 1.22-2
#library(mgcv)

The main function is gam(). Inside this function we can use any combination of non-linear and linear modelling of the various predictors. For example below we use a smoothing spline for temperature and windspeed, and a simple linear model for the categorical variables workingday and season. We then plot the contributions of each predictor using the command plot(). As we can see, GAMs are very useful as they estimate the contribution of the effects of each predictor.

# fit the GAM model for linear regression
gam_lin = gam( count ~ s(temperature) + s(windspeed) + 
                    workingday + season,
                  data = bikes )

summary(gam_lin)

Call: gam(formula = count ~ s(temperature) + s(windspeed) + workingday + 
    season, data = bikes)
Deviance Residuals:
   Min     1Q Median     3Q    Max 
-321.0 -112.0  -32.1   74.7  732.8 

(Dispersion Parameter for gaussian family taken to be 26818)

    Null Deviance: 5.72e+08 on 17378 degrees of freedom
Residual Deviance: 4.66e+08 on 17368 degrees of freedom
AIC: 226543 

Number of Local Scoring Iterations: NA 

Anova for Parametric Effects
                  Df   Sum Sq  Mean Sq F value Pr(>F)    
s(temperature)     1 9.26e+07 92559225  3451.3 <2e-16 ***
s(windspeed)       1 5.82e+06  5820723   217.0 <2e-16 ***
workingday         1 5.63e+04    56313     2.1   0.15    
season             1 3.09e+06  3094437   115.4 <2e-16 ***
Residuals      17368 4.66e+08    26818                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova for Nonparametric Effects
               Npar Df Npar F   Pr(F)    
(Intercept)                              
s(temperature)       3   22.5 1.5e-14 ***
s(windspeed)         3   18.7 4.0e-12 ***
workingday                               
season                                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The summary above shows that the parametric effects of temperature, windspeed, and season all are significant redictors with p-values <0.05. Additionally, the nonparametric effects of temperature and windspeed are significant predictors which shows there is benefit to modeling these non-linearly with the smoothing.

par( mfrow = c(1,4) )
plot( gam_lin,  se = TRUE, col = "blue" )

Note that simply using workingday and season inside gam() is just fitting a linear model for this variables. However, we observe that the workingday variable is binary as it only take the values of 0 and 1, and the season variable is categorical with values 1-4 representing each of the four seasons. This we can see from the x-axis of the workingday and season plots on the right above. So, it would be preferable to use a step function for these variables. In order to do this we have to change the variables to factors. We first create a second object called bikes1 (in order not to change the initial dataset bikes) and then we use the command factor() to change the variables. Then we fit again the same model. As we can see below now gam() fits a step function for variables workingday and season which is more appropriate.

# convert Gender to factor
bikes1 = bikes
bikes1$workingday = factor(bikes1$workingday)
bikes1$season = factor(bikes1$season)

# fit the GAM model for linear regression with updated Gender(factorized)
gam1_lin = gam( count ~ s(temperature) + s(windspeed) + 
                    workingday + season,
                  data = bikes1 )
summary(gam1_lin)

Call: gam(formula = count ~ s(temperature) + s(windspeed) + workingday + 
    season, data = bikes1)
Deviance Residuals:
   Min     1Q Median     3Q    Max 
-329.9 -107.8  -32.0   72.5  773.9 

(Dispersion Parameter for gaussian family taken to be 25828)

    Null Deviance: 5.72e+08 on 17378 degrees of freedom
Residual Deviance: 4.49e+08 on 17366 degrees of freedom
AIC: 225891 

Number of Local Scoring Iterations: NA 

Anova for Parametric Effects
                  Df   Sum Sq  Mean Sq F value Pr(>F)    
s(temperature)     1 9.25e+07 92502263 3581.50 <2e-16 ***
s(windspeed)       1 5.99e+06  5987136  231.81 <2e-16 ***
workingday         1 7.46e+04    74612    2.89  0.089 .  
season             3 2.33e+07  7757856  300.37 <2e-16 ***
Residuals      17366 4.49e+08    25828                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova for Nonparametric Effects
               Npar Df Npar F   Pr(F)    
(Intercept)                              
s(temperature)       3   64.5 < 2e-16 ***
s(windspeed)         3   18.2 9.2e-12 ***
workingday                               
season                                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The significance of the predictor variables did not change from the factroization of workingday and season, but the plots below are now looking much better and we can more easily see the difference in effect of each season specifically.

par( mfrow = c(1,4) )
plot( gam1_lin,  se = TRUE, col = "blue" )

We can make predictions from gam objects, just like lm objects, using the predict() method for the class gam. Here we make predictions on some new data. Note that when assigning the value 1 to workingday and 3 to season, we enclose these values in “” since we informed R to treat workingday and season as a categorical factors with four levels.

preds_gam_lin <- predict( gam1_lin, 
                  newdata = data.frame( temperature = 0.40, windspeed = 0.5,
                                        workingday = '1', season = '1' )  )
preds_gam_lin
  1 
172 

Above, we see that the GAM model predicts that on a working winter day with a temperature of 51 degrees Fahrenheit and a normalized windspeed of 0.5, there will be 172 bikes used.

Linear Fit

It is important to note that one can produce a simple linear regression fit using the gam() function by simply not transforming any of the variables with smoothing or non-linear equations. Below, we generated a simple linear regression to compare with the GAM linear fit above.

library(gam)
# linear fit
linear_mod = gam( count ~ temperature + windspeed + 
                    workingday + season,
                  data = bikes1 )

summary(linear_mod)

Call: gam(formula = count ~ temperature + windspeed + workingday + 
    season, data = bikes1)
Deviance Residuals:
   Min     1Q Median     3Q    Max 
-309.7 -110.9  -29.4   73.6  767.0 

(Dispersion Parameter for gaussian family taken to be 26195)

    Null Deviance: 5.72e+08 on 17378 degrees of freedom
Residual Deviance: 4.55e+08 on 17372 degrees of freedom
AIC: 226131 

Number of Local Scoring Iterations: 2 

Anova for Parametric Effects
               Df   Sum Sq  Mean Sq F value Pr(>F)    
temperature     1 9.37e+07 93677759 3576.11 <2e-16 ***
windspeed       1 6.02e+06  6021342  229.86 <2e-16 ***
workingday      1 4.59e+04    45912    1.75   0.19    
season          3 1.69e+07  5649844  215.68 <2e-16 ***
Residuals   17372 4.55e+08    26195                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We note that the same predictor variables are significant for this model as are in the GAM fit. Workingday is not found to be significant with all the other variables in the model.

preds_lin <- predict( linear_mod, 
                  newdata = data.frame( temperature = 0.40, windspeed = 0.5,
                                        workingday = '1', season = '1' )   )
preds_lin
  1 
207 

The predicted number of bikes used on a day with the same conditions as above is much higher when using the simple linear regression than that found when using the GAM linear regression.

Model Comparison

Now let’s compare our regular regression fit to the GAM fit with smoothing terms. The following shows how one can extract various measures of performance, and the subsequent table shows them gathered together.

AIC(linear_mod)
[1] 226131
AIC(gam1_lin)
[1] 225891

The AIC values are fairly similar, but the GAM model has a lower AIC indicating that in this case it may be better to use than the simple linear regression.

anova(linear_mod, gam1_lin, test="Chisq")
Analysis of Deviance Table

Model 1: count ~ temperature + windspeed + workingday + season
Model 2: count ~ s(temperature) + s(windspeed) + workingday + season
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
1     17372   4.55e+08                         
2     17366   4.49e+08  6  6541431   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With a p-value much smaller than 0.05, the anova test shows that the GAM model in this case had significantly improved the model over using the simple linear regression.

To see the difference between simple linear regression and GAM linear regression, I will produce a simpler example below using only windspeed as a predictor for bike count:

lin_mod = gam( count ~ windspeed, data = bikes1 )
gam_mod = gam( count ~ s(windspeed), data = bikes1 )

predlm = predict(lin_mod)
predgam = predict(gam_mod)

ggplot(bikes1, aes(x=windspeed, y=count)) +
  geom_point(alpha=0.2) +
  geom_line(aes(y=predlm, color='SLR'), lwd=1.5) +
  geom_line(aes(y=predgam, color='GAM LR'), lwd=1.5) +
  scale_color_manual(name='Model', values=c("blue", "red"))

As we see in the plot above, a simple linear regression does not account for the non-linear trend in the data that the GAM model does adjust for. Using this simple model, the SLR would predict far more bikes being used in high windspeeds than the GAM model would. Additionally, the SLR model would under-predict the number of bikes used in mild wind climates. By modeling nonparametric effects, the GAM model is often a better fit for real data because predictors rarely have perfectly linear impacts.

GAM for Logistic Regression (Predicting Holiday or Not)

For GAM for logistic regression model, we estimated the probability of whether it’s being in a holiday or not with a smoothing parameters.

Likewise, as we noticed earlier - in the GAM model for linear regression - that the holiday, ‘workingday’ and season variables are categoriacal variables. So, it would be preferable to use a step function for this variable. In order to do this we have to change the variable to a factor. We first create a second object called bikes2 (in order not to change the initial dataset bikes) and then we use the command factor() to change these three variables. Then we fit GAM model for logistic regression. As we can see below now gam() fits a step function for all three variables - holiday, ‘workingday’ and season - which is more appropriate.

## factor holiday, workingday, and season into categorical variables with levels
bikes2 = bikes
bikes2$holiday = factor(bikes$holiday)
bikes2$workingday = factor(bikes2$workingday)
bikes2$season = factor(bikes2$season)

# fit the GAM model for logistic regression
gam_log = gam( holiday ~ s(temperature) + s(windspeed) + 
                    s(count) + season,
                  data = bikes2 , family = binomial)

summary(gam_log)

Call: gam(formula = holiday ~ s(temperature) + s(windspeed) + s(count) + 
    season, family = binomial, data = bikes2)
Deviance Residuals:
   Min     1Q Median     3Q    Max 
-0.433 -0.277 -0.231 -0.191  3.117 

(Dispersion Parameter for binomial family taken to be 1)

    Null Deviance: 4534 on 17378 degrees of freedom
Residual Deviance: 4390 on 17363 degrees of freedom
AIC: 4422 

Number of Local Scoring Iterations: NA 

Anova for Parametric Effects
                  Df Sum Sq Mean Sq F value Pr(>F)    
s(temperature)     1      7    6.91    7.25 0.0071 ** 
s(windspeed)       1      0    0.30    0.31 0.5774    
s(count)           1      7    6.90    7.23 0.0072 ** 
season             3     43   14.34   15.03  9e-10 ***
Residuals      17363  16563    0.95                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova for Nonparametric Effects
               Npar Df Npar Chisq  P(Chi)    
(Intercept)                                  
s(temperature)       3       55.6 5.1e-12 ***
s(windspeed)         3        5.5    0.14    
s(count)             3       28.2 3.2e-06 ***
season                                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the statistical outcome above, it is found that temperature, count, and ‘season’ variables have a significant effect on determining probability of whether it’s being in a holiday or not, as all p-values are less than 0.05.

par( mfrow = c(1,4) )
plot( gam_log,  se = TRUE, col = "blue" )

Based on the plots shown above, while windspeed does not have significant effect on determining probability of whether it’s being in a holiday or not, the partial spans from windspeed on the y-axis is the largest within these three numeric variables, which implies there is more flexibility and more complex relationship between the predictor and the response.

Now, we can make predictions from gam objects by using the predict() method for the class gam. Here we make predictions on some new data. Note that when assigning the values 0.80 to temperature, 0.5 towindspeed, 70 to count, and 3 to season, where season is treated as a categorical factor with four levels - “1”, “2”, “3”, and “4”.

# prediction with GAM smoothing terms for logistic regression
preds_gam_log <- predict( gam_log, 
                  newdata = data.frame( temperature = 0.80, windspeed = 0.5,
                                        count = 70, season = '3'  )  )
# log-odd
preds_gam_log
    1 
-3.06 
# probability
predicted_probability <- plogis(preds_gam_log)
predicted_probability 
     1 
0.0448 

With assigning the values 0.80 to temperature, 0.5 towindspeed, 70 to count, and 3 to season(summer), the output on the log-odds scale is -3.059871, or probability of determining it’s being in a holiday is 4.479323%. This indicates that the model predicts about 4.479 percent baseline chance of a positive outcome - being in a holiday.

# ROC curve
library(pROC)

# predicted probabilities
pred_probs <- fitted(gam_log)

# ROC curve
roc_curve <- roc(bikes2$holiday, pred_probs)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
# plot ROC curve
plot(roc_curve, main = "GAM Logistic Fit ROC Curve", col = "blue", lwd = 2)

# add AUC to the plot
auc_value <- auc(roc_curve)
legend("bottomright", legend = paste("AUC =", round(auc_value, 2)), col = "blue", lty = 1, cex = 0.8)

The GAM model has a AUC value of 0.65, which is higher than 0.5. Hence, it indicates that this GAM model is better than random choice.

Logistic Fit

Likewise, we performed a simple logistic regression fit using the gam() function by simply not transforming any of the variables with smoothing or non-linear equations. Below is a simple linear regression to compare with the GAM linear fit above.

# logistic fit
log_mod = gam( holiday ~ temperature + windspeed + 
                    count + season,
                  data = bikes2 , family = binomial )

summary(log_mod)

Call: gam(formula = holiday ~ temperature + windspeed + count + season, 
    family = binomial, data = bikes2)
Deviance Residuals:
   Min     1Q Median     3Q    Max 
-0.343 -0.272 -0.230 -0.207  2.928 

(Dispersion Parameter for binomial family taken to be 1)

    Null Deviance: 4534 on 17378 degrees of freedom
Residual Deviance: 4482 on 17372 degrees of freedom
AIC: 4496 

Number of Local Scoring Iterations: 6 

Anova for Parametric Effects
               Df Sum Sq Mean Sq F value  Pr(>F)    
temperature     1     12   11.72   11.83 0.00059 ***
windspeed       1      0    0.26    0.27 0.60658    
count           1      9    8.78    8.86 0.00293 ** 
season          3     28    9.41    9.49 2.9e-06 ***
Residuals   17372  17222    0.99                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With this model, it is also found that temperature, count, and ‘season’ variables have a significant effect on determining probability of whether it’s being in a holiday or not, as all p-values are less than 0.05.

To keep it consistent and easier to compare the outcomes between these two logistic models, we made predictions from gam objects by using the predict() method for the class gam. Here we make predictions on some new data. Note that when assigning the values 0.80 to temperature, 0.5 towindspeed, 70 to count, and 3 to season, where season is treated as a categorical factor with four levels - “1”, “2”, “3”, and “4”.

# prediction with logistic fit
preds_log <- predict( log_mod, 
                  newdata = data.frame( temperature = 0.80, windspeed = 0.5,
                                        count = 70, season = '3'  )  )
# log-odd
preds_log
    1 
-3.49 
# probability
predicted_probability <- plogis(preds_log)
predicted_probability 
     1 
0.0295 

With assigning the values 0.80 to temperature, 0.5 towindspeed, 70 to count, and 3 to season, the output on the log-odds scale is -3.49445, or probability of determining it’s being in a holiday is 2.947055%. This indicates that the model predicts about 2.947 percent baseline chance of a positive outcome - being in a holiday - which is slightly lower than the percentage outcome from the model with smoothing parameters.

# ROC-AUC curve

# predicted probabilities on the training data for logistic fit
predicted_probabilities_log <- predict(log_mod, type = "response")

# ROC curve 
roc_curve_log <- roc(bikes2$holiday, predicted_probabilities_log)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
# lot the ROC curve
plot(roc_curve_log, main = "Simple Logistic Fit ROC Curve", col = "green", lwd = 2)

# add AUC to the plot 
auc_value_log <- auc(roc_curve_log)
legend("bottomright", legend = paste("AUC =", round(auc_value_log, 2)), col = "green", lty = 1, cex = 0.8)

In contrast, this simple logistic regression model that does not have transformation with any of the variables with smoothing or non-linear has a AUC value of 0.59, which is slightly higher than 0.5, but it has a lower ability of to distinguish between the classes. Hence, it indicates that the model is no better than random chance.

# convert predicted probabilities to binary predictions (0 or 1)
predicted_classes <- ifelse(pred_probs > 0.05, 1, 0)

# confusion matrix
confusion_matrix <- table(Actual = bikes2$holiday, Predicted = predicted_classes)

confusion_matrix
      Predicted
Actual     0     1
     0 15930   949
     1   430    70

By having an initial threshold with 0.5 in GAM, the confusion matrix showed respective components of True Negatives (TN) with 16879, False Positives (FP) with 0, False Negatives (FN) with 500, and True Positives (TP) with 0, so it was determined that the threshold was too high. Hence, a new threshold, 0.05, was picked. Although the positive class (1) is underrepresented, and it leads to a situation where TP is relatively small, the updated confusion matrix is slightly improved.

Model Comparison

Now let’s compare our regular regression fit to the GAM fit with smoothing terms. The following shows how one can extract various measures of performance, and the subsequent table shows them gathered together.

AIC(log_mod)
[1] 4496
AIC(gam_log)
[1] 4422

The AIC values are fairly similar, but the GAM model has a lower AIC indicating that in this case it may be better to use than the simple linear regression.

anova(log_mod, gam_log, test="Chisq")
Analysis of Deviance Table

Model 1: holiday ~ temperature + windspeed + count + season
Model 2: holiday ~ s(temperature) + s(windspeed) + s(count) + season
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
1     17372       4482                         
2     17363       4390  9     92.9  4.3e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With a p-value much smaller than 0.05, the anova test shows that the GAM model with smoothing parameters in this case had significantly improved the model over using the simple logistic regression. In addition, although both logistic regression models could not capture a great amount of probability over whether it’s being in a holiday or not, using GAM model with smoothing parameters would have some improvement on our response variable.

GAM vs. GLM

GLMs, or generalized linear models, are a class of statistical models that extend ordinary linear models to handle various types of data distributions and relationships by using link functions and specific error distributions. They’re versatile, accommodating regression and classification problems, but assume a linear relationship between predictors and the response variable on a transformed scale. On the other hand, GAMs, or generalized additive models, take this further by allowing for non-linear relationships between predictors and responses using smoothing functions like splines. GAMs offer more flexibility when the relationships between variables are complex and not adequately captured by a linear model, allowing for a better fit in such scenarios. Ultimately, the choice between GLMs and GAMs depends on the complexity of the relationship one seeks to model in the data.

Mathematical Representation of GAM: \[ g(E[y|x]) = \beta_0 + f_1(x_1) + ... + f_p(x_p) \]

Mathematical Representation of GLM: \[ g(E[y|x]) = \beta_0 + \beta_1x_1 + ... + \beta_px_p \]

Key components of GLMs include:

  1. Linear Predictor: Similar to linear regression, GLMs use a linear combination of predictors (independent variables) to model the relationship with the dependent variable. However, GLMs can handle more than just continuous outcomes; they can work with binary, count, or other non-normally distributed data.

  2. Link Function: GLMs introduce a link function that connects the linear predictor to the response variable. This link function is used to model the relationship between the expected value of the response variable and the linear predictor.

  3. Error Distribution: GLMs allow for different types of error distributions beyond the normal distribution used in simple linear regression. These could include the Poisson distribution for count data, binomial distribution for binary outcomes, or gamma distribution for continuous positive outcomes.

Common examples of GLMs include:

Linear Regression: When the response variable is continuous and normally distributed.

Logistic Regression: Used for binary classification problems where the response variable is categorical (typically binary).

Poisson Regression: Suitable for count data, where the response variable represents counts of events occurring in a fixed interval of time or space.

GLM for Linear Regression

glm_lin = glm( count ~ temperature + windspeed + 
                    workingday + season,
                  data = bikes, family=poisson)

summary(glm_lin)

Call:
glm(formula = count ~ temperature + windspeed + workingday + 
    season, family = poisson, data = bikes)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 3.757438   0.002546  1476.1   <2e-16 ***
temperature 1.948511   0.003058   637.1   <2e-16 ***
windspeed   0.912606   0.004537   201.1   <2e-16 ***
workingday  0.023702   0.001199    19.8   <2e-16 ***
season      0.095868   0.000586   163.7   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 2891591  on 17378  degrees of freedom
Residual deviance: 2332594  on 17374  degrees of freedom
AIC: 2443505

Number of Fisher Scoring iterations: 5

The summary above shows that all the predcitors are significant with p-values <0.05.

preds_glm <- predict(glm_lin, 
                  newdata = data.frame( temperature = 0.40, windspeed = 0.5,
                                        workingday = 1, season = 1))
preds_glm
   1 
5.11 

Above, we see that the GLM model predicts that on a working winter day with a temperature of 51 degrees Fahrenheit and a normalized windspeed of 0.5, there will be 5 bikes used. This prediction differs greatly from the prediction obtained using the GAM model.

AIC(glm_lin)
[1] 2443505

The GAM model has a lower AIC, indiating that in this case it may be better to use GAM models.

anova(glm_lin, gam1_lin, test="Chisq") ##maybe something else
Analysis of Deviance Table

Model 1: count ~ temperature + windspeed + workingday + season
Model 2: count ~ s(temperature) + s(windspeed) + workingday + season
  Resid. Df Resid. Dev Df  Deviance Pr(>Chi)
1     17374   2.33e+06                      
2     17366   4.49e+08  8 -4.46e+08         
glm_mod = glm( count ~ windspeed, data = bikes1 )
gam_mod = gam( count ~ s(windspeed), data = bikes1 )

predglm = predict(glm_mod)
predgam = predict(gam_mod)

ggplot(bikes1, aes(x=windspeed, y=count)) +
  geom_point(alpha=0.2) +
  geom_line(aes(y=predglm, color='GLM'), lwd=1.5) +
  geom_line(aes(y=predgam, color='GAM LR'), lwd=1.5) +
  scale_color_manual(name='Model', values=c("blue", "red"))

gam_mod = gam( count ~ s(windspeed), data = bikes1 )

As we see in the plot above, a GLM regression does not account for the non-linear trend in the data that the GAM model does adjust for.

GLM for Logisitic Regression

glm_log = glm( holiday ~ temperature + windspeed + 
                    count + season,
                  data = bikes2, family=binomial)

summary(glm_log)

Call:
glm(formula = holiday ~ temperature + windspeed + count + season, 
    family = binomial, data = bikes2)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.43618    0.15907  -21.60  < 2e-16 ***
temperature  1.01814    0.40864    2.49  0.01272 *  
windspeed    0.18503    0.37084    0.50  0.61782    
count       -0.00118    0.00032   -3.67  0.00024 ***
season2     -0.73533    0.16126   -4.56  5.1e-06 ***
season3     -0.88289    0.20460   -4.32  1.6e-05 ***
season4     -0.15732    0.12745   -1.23  0.21706    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4533.9  on 17378  degrees of freedom
Residual deviance: 4482.5  on 17372  degrees of freedom
AIC: 4496

Number of Fisher Scoring iterations: 6

All predictors have a significant effect on determining probability of whether it’s being in a holiday or not, as all p-values are less than 0.05.

preds_glm_log <- predict( glm_log, 
                  newdata = data.frame( temperature = 0.80, windspeed = 0.5,
                                        count = 70, season = '3'  )  )
# log-odd
preds_glm_log
    1 
-3.49 
# probability
predicted_probability <- plogis(preds_glm_log)
predicted_probability 
     1 
0.0295 

With assigning the values 0.80 to temperature, 0.5 to windspeed, 70 to count, and 3 to season(summer), the output on the log-odds scale is -3.49445, or probability of being in a holiday is 2.947055%. This indicates that the model predicts about 2.947 percent baseline chance of a positive outcome - being in a holiday. This prediction is smaller than the predictoin obtained using a GAM model

AIC(glm_log)
[1] 4496

The AIC values are similar, indicating using either GAM or GLM for modeling may be appropriate.

anova(glm_log, gam_log, test="Chisq")
Analysis of Deviance Table

Model 1: holiday ~ temperature + windspeed + count + season
Model 2: holiday ~ s(temperature) + s(windspeed) + s(count) + season
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
1     17372       4482                         
2     17363       4390  9     92.9  4.3e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With a p-value much smaller than 0.05, the anova test shows that the GAM model with smoothing parameters in this case had significantly improved the model over using the GLM model.

# predicted probabilities
predict_prob_glm <- predict(glm_log, type = "response")

# ROC curve
roc_curve <- roc(bikes2$holiday, predict_prob_glm)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
# plot ROC curve
plot(roc_curve, main = "GLM Logistic Fit ROC Curve", col = "blue", lwd = 2)
# add AUC to the plot
auc_value <- auc(roc_curve)
legend("bottomright", legend = paste("AUC =", round(auc_value, 2)), col = "blue", lty = 1, cex = 0.8)

The GLM model has a AUC value of 0.59, which is higher than 0.5. Hence, it indicates that this GLM model is better than random choice.

# Assuming 'actual_values' contains the actual 'holiday' values for the test data

# Function to convert probabilities to binary predictions
convert_to_binary <- function(probabilities, threshold = 0.05) {
  predictions <- ifelse(probabilities >= threshold, 1, 0)
  return(predictions)
}

# Make predictions using the logistic regression model
preds_glm_log <- predict(glm_log, 
                         type = "response")

# Convert predicted probabilities to binary predictions (0 or 1)
binary_preds <- convert_to_binary(preds_glm_log)

# Assuming 'actual_values' contains the actual 'holiday' values for the test data
# Replace 'actual_values' with the actual values from your dataset

# Create a confusion matrix
conf_matrix <- table(Actual = bikes2$holiday, Predicted = binary_preds)

# Display the confusion matrix
print(conf_matrix)
      Predicted
Actual     0     1
     0 16778   101
     1   500     0

Although the threshold was set at 0.05, the GLM logistic model did not show any True Positives (TP). Conversely, the GAM logistic model displayed TP at the same threshold, indicating that the GAM model may be better suited for this particular dataset.

Final Thoughts

In conclusion, this report has delved into the application of generalized additive models (GAMs) in the context of linear regression and logistic regression, providing a comprehensive exploration of their capabilities and a comparative analysis against generalized linear models (GLMs). The findings and insights derived from this study highlight several key points that contribute to the understanding of the efficacy of GAMs in various regression tasks.

Firstly, the flexibility of GAMs in capturing complex non-linear relationships was evident throughout the analysis. The ability to incorporate smooth functions of predictors allows GAMs to adapt to intricate data patterns, providing a more nuanced representation of relationships compared to the linearity assumptions in traditional GLMs. This flexibility is particularly valuable when dealing with datasets where relationships are not easily characterized by linear models. Moreover, the comparison against GLMs revealed that GAMs outperformed in scenarios with non-linear dependencies. The improved predictive accuracy of GAMs showcase their advantage in capturing the inherent complexities present in real-world datasets.

However, it is crucial to acknowledge the trade-offs associated with the increased flexibility of GAMs. The interpretability of the model can be compromised, and careful consideration is required in selecting the appropriate smoothing functions to avoid overfitting. Furthermore, the potential increase in model complexity demands thoughtful validation procedures to ensure the generalizability of the model to new data.

In summary, the application of generalized ddditive models has demonstrated their efficacy in capturing non-linear relationships in both linear regression and logistic regression scenarios. The comparison against Generalized Linear Models reveals the significance of employing GAMs when dealing with complex, non-linear dependencies. As with any modeling approach, a nuanced understanding of the data and thoughtful consideration of model complexities are essential for maximizing the benefits of GAMs in regression tasks. The findings presented here contribute to the growing body of knowledge on advanced regression techniques, offering practitioners valuable insights into the utility of GAMs in capturing the intricacies of real-world data.